[codex] Fix reusable eval artifact validation#1841
Merged
Conversation
Contributor
|
Claude finished @Oseltamivir's task in 1m 2s —— View job Reviewing PR #1841
LGTM — no blocking issues found. The approach is sound: replacing brittle prefix-matching against physical runner names with logical identity extraction from |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
meta_env.jsonlogical matrix identitymi300xjobs running onmi300x-amds_04perf-changelog.yamlis unchangedRoot cause
Merge run 27798519988 reused a successful PR sweep, but job 82263580083 rejected its eval artifacts. The validator built expected artifact prefixes with the logical config runner (
mi300x), while artifact names contain the physical runner (mi300x-amds_01ormi300x-amds_04). Valid artifacts were therefore reported as both missing and unexpected.The initial PR check also failed after all tests passed because
test-changelog-gate.ymlunconditionally validated the associated PR's changelog diff. This PR does not modifyperf-changelog.yaml, so the validator correctly reported that the file had no changes. The workflow now skips only that final validation when the PR has no changelog diff.Impact
Merge-time ingest can now reuse successful eval artifacts from pooled self-hosted runners without depending on runner-name formatting. Missing, unexpected, and duplicate eval identities are still rejected using the artifact metadata. Changelog-changing PRs retain full append-only validation.
Validation
uv run --with pytest --with pydantic --with pyyaml python -m pytest utils/changelog_gate_tests/test_validate_perf_changelog.py utils/test_find_reusable_sweep_run.py utils/changelog_gate_tests/test_prepare_perf_changelog_merge.py utils/changelog_gate_tests/test_recover_failed_ingest.py utils/test_validate_reusable_sweep_artifacts.py utils/changelog_gate_tests/test_run_sweep_gating.py -v(97 passed)uv run --with pytest --with pydantic --with pyyaml python -m pytest utils/matrix_logic/ -v(156 passed)actionlint .github/workflows/test-changelog-gate.yml27794193288(passed)Note
Medium Risk
Touches merge/reuse artifact gating and CI changelog checks for sweep workflows; incorrect logic could allow bad reuse or skip needed validation, but scope is limited to eval artifact matching and optional PR changelog diff.
Overview
Fixes merge-time reuse rejecting valid eval artifacts when directory names use physical runner labels (e.g.
mi300x-amds_04) while the matrix expects logical runners (mi300x).Reusable sweep eval validation no longer matches expected
eval_*directory prefixes. It reads each raw eval dir’smeta_env.json, builds the same logical identity as aggregates (expected_eval_keys/eval_key), and still enforces exact coverage for raw dirs andeval_results_all(missing, unexpected, duplicates).eval_keyalso acceptsinfmax_model_prefixwhenmodel_prefixis absent.Test changelog workflow skips the associated-PR
validate_perf_changelog.pystep when the PR diff does not touchperf-changelog.yaml, so infra-only PRs are not failed for “no changelog changes.”Tests were updated with
meta_env.json-based fixtures (including logicalmi300xon physicalmi300x-amds_04), looser assertions on error text, and a workflow regression for the changelog skip.Reviewed by Cursor Bugbot for commit 44421d2. Bugbot is set up for automated code reviews on this repo. Configure here.